Introduction to sampling

python

datacamp

statistics

machine learning

sampling

Author

kakamana

Published

January 8, 2023

Introduction to sampling

Get a better understanding of what sampling is and why it is so powerful. Additionally, We will learn about the problems associated with convenience sampling and what the difference between true randomness and pseudo-randomness is.

This Introduction to sampling is part of Datacamp course: Introduction to sampling

This is my learning experience of data science through DataCamp

Code

# Import seaborn with alias sns
import pandas as pd
import seaborn as sns
import numpy as np

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

Sampling and point estimates

Population: It is complete dataset
Sample: It is subset of data you calculate on

Population parameter: It is a calculation on population dataset Points vs. flavor: population pts_vs_flavor_pop = coffee_ratings[[“total_cup_points”, “flavor”]] np.mean(pts_vs_flavor_pop[‘total_cup_points’])

Point estimate: Or sample statistic is a calculation made on sample dataset Points vs. flavor: 10 row sample pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10) cup_points_samp = coffee_ratings[‘total_cup_points’].sample(n=10) np.mean(cup_points_samp)

Simple sampling with pandas

The purpose of this exercise is to explore Spotify song data. There are over 40,000 rows in this population dataset, each representing a song. These columns include the title of the song, the artists who performed it, the release year, and attributes of the song, such as its duration, tempo, and danceability. To begin, you should examine the durations.

The Spotify dataset will be sampled and the mean duration of the sample will be compared with the mean duration of the population.

Code

spotify_population=pd.read_feather("dataset/spotify_2000_2020.feather")
spotify_population.head()

	acousticness	artists	danceability	duration_ms	duration_minutes	energy	explicit	id	instrumentalness	key	liveness	loudness	mode	name	popularity	release_date	speechiness	tempo	valence	year
0	0.97200	['David Bauer']	0.567	313293.0	5.221550	0.227	0.0	0w0D8H1ubRerCXHWYJkinO	0.601000	10.0	0.110	-13.441	1.0	Shout to the Lord	47.0	2000	0.0290	136.123	0.0396	2000.0
1	0.32100	['Etta James']	0.821	360240.0	6.004000	0.418	0.0	4JVeqfE2tpi7Pv63LJZtPh	0.000372	9.0	0.222	-9.841	0.0	Miss You	51.0	2000-12-12	0.0407	117.382	0.8030	2000.0
2	0.00659	['Quasimoto']	0.706	202507.0	3.375117	0.602	1.0	5pxtdhLAi0RTh1gNqhGMNA	0.000138	11.0	0.400	-8.306	0.0	Real Eyes	44.0	2000-06-13	0.3420	89.692	0.4790	2000.0
3	0.00390	['Millencolin']	0.368	173360.0	2.889333	0.977	0.0	3jRsoe4Vkxa4BMYqGHX8L0	0.000000	11.0	0.350	-2.757	0.0	Penguins & Polarbears	52.0	2000-02-22	0.1270	165.889	0.5480	2000.0
4	0.12200	['Steve Chou']	0.501	344200.0	5.736667	0.511	0.0	4mronxcllhfyhBRqyZi8kU	0.000000	7.0	0.279	-9.836	0.0	黃昏	53.0	2000-12-25	0.0291	78.045	0.1130	2000.0

Code

# Sample 1000 rows from spotify_population
spotify_sample = spotify_population.sample(n=1000)

# Print the sample
print(spotify_sample)

       acousticness                                            artists  \
7874        0.66400                                    ['The Walters']   
2664        0.01020                    ['Beastie Boys', 'Fatboy Slim']   
1683        0.00241                                          ['batta']   
14491       0.11400            ['AJ Mitchell', 'Ava Max', 'Sam Feldt']   
34495       0.96800                                         ['Yiruma']   
...             ...                                                ...   
25541       0.85400  ['Andrew Lloyd Webber', 'Patrick Wilson', 'Emm...   
904         0.71900                                   ['Carl Carlton']   
26932       0.02910                                    ['Cory Asbury']   
30144       0.14900                              ['Twenty One Pilots']   
12676       0.35000                                  ['Grupo Intenso']   

       danceability  duration_ms  duration_minutes  energy  explicit  \
7874          0.747     151683.0          2.528050   0.422       0.0   
2664          0.650     248507.0          4.141783   0.942       0.0   
1683          0.389     145400.0          2.423333   0.988       0.0   
14491         0.732     193548.0          3.225800   0.850       0.0   
34495         0.287     218293.0          3.638217   0.292       0.0   
...             ...          ...               ...     ...       ...   
25541         0.194     294160.0          4.902667   0.119       0.0   
904           0.546     153947.0          2.565783   0.828       0.0   
26932         0.572     333386.0          5.556433   0.685       0.0   
30144         0.550     277013.0          4.616883   0.625       0.0   
12676         0.718     219960.0          3.666000   0.529       0.0   

                           id  instrumentalness  key  liveness  loudness  \
7874   70QqoQ3krRFUHfEzit7vjT          0.002770  7.0    0.3920   -10.008   
2664   2WGGxhsc2WtPNkhsXWVcYb          0.000000  1.0    0.1220    -6.609   
1683   5V5akuBxKpIlTUPaueNpyy          0.000615  6.0    0.3460    -1.949   
14491  2wenGTypSYHXl1sN1pNC7X          0.000002  1.0    0.0388    -5.999   
34495  3xr8COed4nPPn6XWZ0iCGr          0.978000  9.0    0.0900   -19.285   
...                       ...               ...  ...       ...       ...   
25541  5klrh466oGToybceGHPGAX          0.000737  1.0    0.1090   -20.926   
904    5i7rT8lbGzjj1n7TTXR5U8          0.030000  4.0    0.3720    -4.771   
26932  0rH0mprtecH3grD9HFM5AD          0.000000  6.0    0.0963    -7.290   
30144  4IN3imzEuTsiHO6tOwDQu5          0.000000  1.0    0.1610    -8.213   
12676  0l7q7H1zYiJ9XHVqim2Uwc          0.000000  7.0    0.1510    -8.769   

       mode                                           name  popularity  \
7874    1.0                                   Goodbye Baby        55.0   
2664    1.0  Body Movin' - Fatboy Slim Remix/2005 Remaster        48.0   
1683    0.0                                          chase        61.0   
14491   1.0   Slow Dance (feat. Ava Max) - Sam Feldt Remix        72.0   
34495   1.0                             River Flows in You        62.0   
...     ...                                            ...         ...   
25541   1.0                               All I Ask Of You        57.0   
904     1.0                               Everlasting Love        48.0   
26932   1.0                                  Reckless Love        71.0   
30144   0.0                                       Trapdoor        56.0   
12676   1.0                                         Y Volo        45.0   

      release_date  speechiness    tempo  valence    year  
7874    2015-10-20       0.0294  111.141   0.5950  2015.0  
2664    2005-01-01       0.0754  101.786   0.7850  2005.0  
1683    2016-07-27       0.1070  111.975   0.0996  2016.0  
14491   2019-10-25       0.0444  124.024   0.3720  2019.0  
34495   2011-12-09       0.0541  145.703   0.3460  2011.0  
...            ...          ...      ...      ...     ...  
25541   2004-12-10       0.0398   85.698   0.1400  2004.0  
904     2009-01-01       0.0394  121.418   0.4010  2009.0  
26932   2018-01-26       0.0356  110.698   0.2320  2018.0  
30144   2009-12-29       0.0399  149.927   0.3170  2009.0  
12676   2001-10-16       0.0325  142.069   0.9440  2001.0  

[1000 rows x 20 columns]

Code

# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify_population['duration_minutes'].mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample['duration_minutes'].mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)
print("\n Notice that the mean song duration in the sample is similar, but not identical to the mean song duration in the whole population.")

3.8521519140900073
3.8048647333333334

 Notice that the mean song duration in the sample is similar, but not identical to the mean song duration in the whole population.

Simple sampling and calculating with NumPy

Code

# Subset the loudness column of spotify_population
loudness_pop = spotify_population['loudness']

# Sample 100 values of loudness_pop
loudness_samp = loudness_pop.sample(n=100)

# Print the sample
print(loudness_samp)

28889   -6.697
41542   -3.166
24096   -8.327
38021   -5.596
29360   -3.779
         ...  
34958   -7.206
41513   -7.560
28413   -7.113
41429   -6.073
38857   -4.433
Name: loudness, Length: 100, dtype: float64

Code

# Calculate the mean of loudness_pop
mean_loudness_pop = np.mean(loudness_pop)

# Calculate the mean of loudness_samp
mean_loudness_samp = np.mean(loudness_samp)

# Print the means
print(mean_loudness_pop)
print(mean_loudness_samp)
print("\n Again, notice that the calculated value (the mean) is close but not identical in each case")

-7.366856851353947
-7.385839999999999

 Again, notice that the calculated value (the mean) is close but not identical in each case

Convenience sampling

Collecting data by easiest method is convenience sampling

Sample bias: sample not true representation of population Selection bias

Are findings from the sample generalizable?

In your previous example, you saw that convenience sampling, which is the collection of data using the simplest method, can lead to samples that are not representative of the population. In other words, the findings of the sample cannot be generalized to the entire population. It is possible to determine whether or not a sample is representative of the population by examining the distributions of the population and the sample

Code

# Visualize the distribution of acousticness with a histogram
width = 0.01
spotify_population['acousticness'].hist(bins=np.arange(0,1.01,width))
plt.show()

Code

spotify_mysterious_sample=spotify_population.sample(n=1107)
# Update the histogram to use spotify_mysterious_sample
spotify_mysterious_sample['acousticness'].hist(bins=np.arange(0, 1.01, 0.01))
plt.show()

Code

# Visualize the distribution of duration_minutes as a histogram
spotify_population['duration_minutes'].hist(bins=np.arange(0,15.5,0.5))
plt.show()

Code

spotify_mysterious_sample2=spotify_population.sample(n=50)
# Update the histogram to use spotify_mysterious_sample2
spotify_mysterious_sample2['duration_minutes'].hist(bins=np.arange(0, 15.5, 0.5))
plt.show()

Pseudo-random number generation

Code

# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)

# Print uniforms
print(uniforms)

# Plot a histogram of uniform values, binwidth 0.25
plt.hist(uniforms, bins=np.arange(-3,3.25,0.25))
plt.show()

[ 0.46978238 -1.66176314  1.31080161 ...  0.27384823  0.57683707
  1.94834767]

Code

# Generate random numbers from a Normal(5, 2)
normals = np.random.normal(loc=5,scale=2,size=5000)

# Print normals
print(normals)

# Plot a histogram of normal values, binwidth 0.5
plt.hist(normals,np.arange(-2,13.5,0.5))
plt.show()

[0.37088348 5.46510043 5.6804744  ... 5.06268094 7.1989964  4.26078515]